What drives the price of a car?¶

OVERVIEW
In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing. Your goal is to understand what factors make a car more or less expensive. As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.
CRISP-DM Framework¶
To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM. This process provides a framework for working through a data problem. Your first step in this application will be to read through a brief overview of CRISP-DM here. After reading the overview, answer the questions below.
Business Understanding¶
From a business perspective, we are tasked with identifying key drivers for used car prices. In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition. Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary.
The business objective is to identify the critical attributes from Vehicle datasets and generate a model that can predict the price when key attributes are passed to the model. The important car features will help the salesperson sell used cars to buyers. To achieve the results the vehicle dataset is downloaded from the Kaggle machine learning website which gives insight into used car attributes.
The success criteria of this project are to identify the key features and create a functional model that can be used by Car dealers to drive the sales of used Cars.
This project will utilize multiple Python packages & APIs and resources from open-source content to develop a model.
Data Understanding¶
After considering the business understanding, we want to get familiar with our data. Write down some steps that you would take to get to know the dataset and identify any quality issues within. Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.
In this section, the focus will be on
- Upload the vehicle dataset for analysis and model generation.
- Data analysis of numerical & categorical features and perform the quality check on attributes.
- Identify data incompleteness and attributes which is irrelevant due to missing data and can be dropped.
#### Initial Data collection ####
# The data of the used vehicle is downloaded from Kaggle. The data contains the attributes related to used cars that were sold.
# The Goal is to find the attributes that will help the salesperson identify the key attributes customers prefer in used cars.
## Import the Python libraries ##
import pandas as pd
import plotly.express as px
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer, KNNImputer, SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from category_encoders import TargetEncoder
from sklearn.pipeline import Pipeline
from concurrent.futures import ThreadPoolExecutor
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn import set_config
set_config(display="diagram")
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SequentialFeatureSelector, SelectFromModel
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.decomposition import PCA
from plotly.subplots import make_subplots
import numpy.ma as ma
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
## Load the CSV file:
df = pd.read_csv("data/vehicles.csv")
## Print the sample dataset
df.head()
| id | region | price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | transmission | VIN | drive | size | type | paint_color | state | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7222695916 | prescott | 6000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | az |
| 1 | 7218891961 | fayetteville | 11900 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ar |
| 2 | 7221797935 | florida keys | 21000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | fl |
| 3 | 7222270760 | worcester / central MA | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ma |
| 4 | 7210384030 | greensboro | 4900 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | nc |
## Description of dataset:
df.describe()
| id | price | year | odometer | |
|---|---|---|---|---|
| count | 4.268800e+05 | 4.268800e+05 | 425675.000000 | 4.224800e+05 |
| mean | 7.311487e+09 | 7.519903e+04 | 2011.235191 | 9.804333e+04 |
| std | 4.473170e+06 | 1.218228e+07 | 9.452120 | 2.138815e+05 |
| min | 7.207408e+09 | 0.000000e+00 | 1900.000000 | 0.000000e+00 |
| 25% | 7.308143e+09 | 5.900000e+03 | 2008.000000 | 3.770400e+04 |
| 50% | 7.312621e+09 | 1.395000e+04 | 2013.000000 | 8.554800e+04 |
| 75% | 7.315254e+09 | 2.648575e+04 | 2017.000000 | 1.335425e+05 |
| max | 7.317101e+09 | 3.736929e+09 | 2022.000000 | 1.000000e+07 |
## The vehicle dataset consists of the following listed attributes. We can see columns with very high missing data on calculating the
## percentage of missing data.
# Print the dataset information
print(df.info())
# Identify and print the dataset
md = pd.DataFrame(df.isnull().sum(), columns=['count']).sort_values(by='count', ascending=False)
md['percent'] = round((md['count']/4268.80),1)
md =md.reset_index().rename(columns={'index': 'Car Attributes'})
# Plot the missing attributes in the dataset
fig = px.bar(md, x='Car Attributes', y='percent', text= 'percent', color='Car Attributes', title = '% of missing data per attribute')
fig.show()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 426880 entries, 0 to 426879 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 426880 non-null int64 1 region 426880 non-null object 2 price 426880 non-null int64 3 year 425675 non-null float64 4 manufacturer 409234 non-null object 5 model 421603 non-null object 6 condition 252776 non-null object 7 cylinders 249202 non-null object 8 fuel 423867 non-null object 9 odometer 422480 non-null float64 10 title_status 418638 non-null object 11 transmission 424324 non-null object 12 VIN 265838 non-null object 13 drive 296313 non-null object 14 size 120519 non-null object 15 type 334022 non-null object 16 paint_color 296677 non-null object 17 state 426880 non-null object dtypes: float64(2), int64(2), object(14) memory usage: 58.6+ MB None
## Identify the cumulative count of missing data per Row. This is help in identifying the rows where the majority of data is missing and can be dropped.
me = pd.DataFrame((df.isnull().sum(axis=1)),columns=['count'])
me = me.groupby(['count'])[['count']].count()
me['CV'] = me['count'].cumsum()
print(me)
## Print the Bar chart
fig = px.bar(me, y='CV', text= 'CV', color='count', title = 'Cumulative count of missing data per rows(e.g. - All rows present = 34868)')
fig.show()
count CV count 0 34868 34868 1 78259 113127 2 99854 212981 3 91251 304232 4 46728 350960 5 21196 372156 6 19246 391402 7 30443 421845 8 4279 426124 9 125 426249 10 539 426788 11 24 426812 14 68 426880
Conclusion of Data Analysis¶
- The "size" column has 71.8% of missing data hence, the column can be dropped.
- Drop the ID column as it will not have much significance for the model and customer.
- The column "cylinders" will not add value as almost ~50% (missing data+unknown value) of data is missing. Hence this column can be dropped.
- The condition column can be significant to customers hence deciding to Impute the data.
- Drop the VIN column as it may not contribute to customer preferences.
- As per the graph, some rows are missing many data points, dropping the rows with more than 5 NAN values.
- We will select the rest of the columns and then apply the feature selection model to identify the key features.
Data Preparation¶
After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset before modeling. Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with sklearn.
The following data cleanup exercise is done as per the previous data analysis & to prepare the data for Modelling:
- Drop the Identified columns due to the high percent of missing data or
- Drop the rows having more than 5 NAN columns
- Remove Junk characters from the datasets
- Use Data impute techniques to fill in missing data for both numeric & categorical fields
- Encode the data so that it can be executed in various models
- Scale the dataset and spilt the Test and Train datasets
- Analyse the dataset correlation between feature
- And run PCA against to analyze the dataset dimensionality
- Final dataset for Modelling
## Data cleanup after deleting the unwanted columns and after deleting the column, the rows with more than 5 missing data were deleted.
## The final chart shows the count of missing data per column. This missing data will be imputed
##
# Drop the columns 'size','id','cylinders','VIN' due to missing data and will not add much relevance to the model.
new_df = df.drop(['size','id','cylinders','VIN'], axis=1)
# Delete rows with more than 5 columns missing
new_df.dropna(thresh=len(new_df.columns) - 5,axis=0, inplace=True)
# Replace junk chars from the model column
pattern = r'[^a-zA-Z0-9\s]'
new_df['model'] = new_df['model'].replace(pattern, '', regex=True)
# Count of missing data per attribute
mg = pd.DataFrame(new_df.isnull().sum(), columns=['count']).sort_values(by='count', ascending=False)
display(mg)
# Plot the missing data by feature. This will help
fig = px.bar(mg, x='count', text= 'count', color='count', width=1000, height=600 , title = 'Count of missing data per feature')
fig.show()
| count | |
|---|---|
| condition | 173269 |
| drive | 129733 |
| paint_color | 129410 |
| type | 92033 |
| manufacturer | 17427 |
| title_status | 7424 |
| model | 5209 |
| odometer | 3759 |
| fuel | 2186 |
| transmission | 1868 |
| year | 1128 |
| region | 0 |
| price | 0 |
| state | 0 |
## Data Imputation:
## Impute Iterative method to fill in the missing data ##
X = df_imputed = new_df.copy()
impc = ['condition','drive','year','paint_color','type','manufacturer','title_status','model','fuel','transmission','region','state']
impn = ['odometer','price']
imputerc = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputern = SimpleImputer(missing_values=np.nan, strategy='mean')
# Use the imputer to impute the null values in the specified columns
X[impc] = imputerc.fit_transform(X[impc])
X[impn] = imputerc.fit_transform(X[impn])
X.isnull().sum()
cat_selector = make_column_selector(dtype_include=object)
## Cardinality of categorical columns in the data frame:
Ca = pd.DataFrame(X[cat_selector(X)].nunique(), columns=['count'])
print(Ca)
count region 404 year 114 manufacturer 42 model 28946 condition 6 fuel 5 title_status 6 transmission 3 drive 3 type 13 paint_color 12 state 51
## Create a new copy of the dataset
X = df_imputed = new_df.copy()
## Select the numerical and categorical features from the dataset
cat_selector = make_column_selector(dtype_include=object)
numerical_features = num_selector = make_column_selector(dtype_include=np.number)
## Pipeline for data imputation using simple imputer most frequent strategy
cat_linear_processor = make_pipeline(
SimpleImputer(strategy='most_frequent')
)
## Pipeline for data imputation using simple imputer mean strategy
num_linear_processor = make_pipeline(
# StandardScaler(),
SimpleImputer(strategy='mean')
)
## Data Imputer Steps ##
dataImputer = ColumnTransformer(transformers=[
('numImputer(Starategy = Mean)', num_linear_processor, num_selector),
('catImputer(Startegy = Most Frequent)', cat_linear_processor, cat_selector),
],remainder='passthrough')
## Fit the data Imputer function
X = pd.DataFrame(dataImputer.fit_transform(X), columns=['price', 'year', 'odometer','region', 'manufacturer', 'model', 'condition', 'fuel', 'title_status', 'transmission', 'drive', 'type', 'paint_color', 'state'])
## X dataframe is now having Imputed values
print(X.isnull().sum())
print(' ')
dataImputer
price 0 year 0 odometer 0 region 0 manufacturer 0 model 0 condition 0 fuel 0 title_status 0 transmission 0 drive 0 type 0 paint_color 0 state 0 dtype: int64
ColumnTransformer(remainder='passthrough',
transformers=[('numImputer(Starategy = Mean)',
Pipeline(steps=[('simpleimputer',
SimpleImputer())]),
<sklearn.compose._column_transformer.make_column_selector object at 0x00000257F6211850>),
('catImputer(Startegy = Most Frequent)',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='most_frequent'))]),
<sklearn.compose._column_transformer.make_column_selector object at 0x00000257816E0C50>)])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder='passthrough',
transformers=[('numImputer(Starategy = Mean)',
Pipeline(steps=[('simpleimputer',
SimpleImputer())]),
<sklearn.compose._column_transformer.make_column_selector object at 0x00000257F6211850>),
('catImputer(Startegy = Most Frequent)',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='most_frequent'))]),
<sklearn.compose._column_transformer.make_column_selector object at 0x00000257816E0C50>)])<sklearn.compose._column_transformer.make_column_selector object at 0x00000257F6211850>
SimpleImputer()
<sklearn.compose._column_transformer.make_column_selector object at 0x00000257816E0C50>
SimpleImputer(strategy='most_frequent')
[]
passthrough
## Function for Target encoder
def highCardinality_encode_column(dfm, column):
encoder = TargetEncoder()
return encoder.fit_transform(dfm[[column]], dfm['price'])
# List of columns to encode
high_cardinality_features = ['region', 'manufacturer', 'condition','fuel', 'title_status', 'drive', 'state']
low_cardinality_features = ['transmission', 'type', 'paint_color']
encoded_results = {}
# Use ThreadPoolExecutor for parallel processing
with ThreadPoolExecutor() as executor:
futures = {executor.submit(highCardinality_encode_column, X, col): col for col in high_cardinality_features}
for future in futures:
col = futures[future]
encoded_results[col] = future.result()
# Combine encoded results into the original DataFrame
for col, encoded in encoded_results.items():
X[col + '_encoded'] = encoded
## The Target encoder function had the issue of encoding the model column, hence encoding it manually.
mean_encoded = X.groupby('model')['price'].mean()
X['model_encoded'] = X['model'].map(mean_encoded)
## Encoding the low cardinality features using the frequency count encoder function
for col in low_cardinality_features:
frequency_encoded = X[col].value_counts()
X[col + '_encoded'] = X[col].map(frequency_encoded)
# Display the DataFrame with the encoded features
X.head()
| price | year | odometer | region | manufacturer | model | condition | fuel | title_status | transmission | ... | manufacturer_encoded | condition_encoded | fuel_encoded | title_status_encoded | drive_encoded | state_encoded | model_encoded | transmission_encoded | type_encoded | paint_color_encoded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 33590.0 | 2014.0 | 57923.0 | auburn | gmc | sierra 1500 crew cab slt | good | gas | clean | other | ... | 30426.023100 | 71012.649463 | 73584.761409 | 77315.43855 | 107316.446103 | 239642.53219 | 35224.934498 | 62672 | 43510 | 208693 |
| 1 | 22590.0 | 2010.0 | 71229.0 | auburn | chevrolet | silverado 1500 | good | gas | clean | other | ... | 115820.592547 | 71012.649463 | 73584.761409 | 77315.43855 | 107316.446103 | 239642.53219 | 20619.683389 | 62672 | 43510 | 31223 |
| 2 | 39590.0 | 2020.0 | 19160.0 | auburn | chevrolet | silverado 1500 crew | good | gas | clean | other | ... | 115820.592547 | 71012.649463 | 73584.761409 | 77315.43855 | 107316.446103 | 239642.53219 | 34064.27593 | 62672 | 43510 | 30460 |
| 3 | 30990.0 | 2017.0 | 41124.0 | auburn | toyota | tundra double cab sr | good | gas | clean | other | ... | 235060.628139 | 71012.649463 | 73584.761409 | 77315.43855 | 107316.446103 | 239642.53219 | 34749.481707 | 62672 | 43510 | 30460 |
| 4 | 15000.0 | 2013.0 | 128000.0 | auburn | ford | f150 xlt | excellent | gas | clean | automatic | ... | 35929.010235 | 51346.825953 | 73584.761409 | 77315.43855 | 40796.308366 | 239642.53219 | 18396.925068 | 338265 | 35279 | 62859 |
5 rows × 25 columns
#### Create Test and Train dataset for the Model #####
X_transformed = X[['price','year','odometer','region_encoded', 'manufacturer_encoded', 'condition_encoded','fuel_encoded', 'title_status_encoded', 'drive_encoded', 'state_encoded','transmission_encoded', 'type_encoded', 'paint_color_encoded']]
X_scaled = (X_transformed - X_transformed.mean())/(X_transformed.std())
# Scale the dataset
X_t = X_scaled.drop(columns = 'price')
y_t = X_scaled['price']
# Train the dataset
X_train, X_test, y_train, y_test = train_test_split(X_t, y_t, test_size = 0.3, random_state = 42)
X_train.head()
| year | odometer | region_encoded | manufacturer_encoded | condition_encoded | fuel_encoded | title_status_encoded | drive_encoded | state_encoded | transmission_encoded | type_encoded | paint_color_encoded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 158946 | 0.187207 | 0.082042 | -0.197052 | -0.529538 | -0.049189 | -0.126011 | 0.180677 | 0.779027 | -0.339408 | 0.508288 | 1.127212 | 1.008512 |
| 69643 | 0.61074 | -0.223239 | -0.168788 | 0.386326 | -0.049189 | -0.613784 | 0.180677 | 0.779027 | 0.257239 | 0.508288 | -0.804167 | 1.008512 |
| 33466 | -0.024559 | 0.384802 | -0.198494 | -0.529538 | -0.273124 | -0.126011 | 0.180677 | 0.779027 | 0.257239 | 0.508288 | 1.127212 | -0.689043 |
| 311528 | -0.236326 | 0.107359 | -0.186722 | 4.355051 | -0.049189 | 3.089299 | 0.180677 | 0.779027 | 0.914641 | 0.508288 | -0.323078 | 1.008512 |
| 176464 | 0.187207 | 0.084479 | -0.203122 | 0.720762 | -0.273124 | -0.126011 | 0.180677 | 0.779027 | -0.329358 | -2.189783 | -0.323078 | -1.343997 |
## Plot co-relation matrics on the feature column.
corr = X_train.corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values,
cmap="coolwarm",
vmin=-1,
vmax=1,
annot=True,
fmt='.2f')
plt.title("Correlation Heatmap of vehicle dataset")
plt.show()
corr.head(200)
| year | odometer | region_encoded | manufacturer_encoded | condition_encoded | fuel_encoded | title_status_encoded | drive_encoded | state_encoded | transmission_encoded | type_encoded | paint_color_encoded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| year | 1.000000 | -0.155131 | -0.017714 | -0.018380 | -0.174635 | -0.040043 | 0.030979 | 0.019509 | -0.008890 | -0.009237 | -0.054727 | 0.042891 |
| odometer | -0.155131 | 1.000000 | 0.014160 | 0.001405 | 0.068618 | 0.051401 | -0.011579 | 0.018927 | 0.003487 | 0.067937 | 0.037033 | 0.006945 |
| region_encoded | -0.017714 | 0.014160 | 1.000000 | -0.000704 | 0.001167 | 0.010083 | 0.006737 | -0.004000 | 0.582148 | 0.004632 | 0.012009 | 0.003770 |
| manufacturer_encoded | -0.018380 | 0.001405 | -0.000704 | 1.000000 | -0.006395 | -0.063489 | 0.014712 | 0.046917 | 0.002449 | 0.037917 | 0.004149 | -0.017659 |
| condition_encoded | -0.174635 | 0.068618 | 0.001167 | -0.006395 | 1.000000 | 0.014655 | -0.026491 | -0.015792 | 0.003404 | -0.025620 | -0.008420 | -0.031176 |
| fuel_encoded | -0.040043 | 0.051401 | 0.010083 | -0.063489 | 0.014655 | 1.000000 | 0.005178 | 0.147375 | 0.011325 | 0.082067 | -0.040669 | 0.085349 |
| title_status_encoded | 0.030979 | -0.011579 | 0.006737 | 0.014712 | -0.026491 | 0.005178 | 1.000000 | 0.015577 | 0.007637 | -0.034818 | -0.039897 | 0.023134 |
| drive_encoded | 0.019509 | 0.018927 | -0.004000 | 0.046917 | -0.015792 | 0.147375 | 0.015577 | 1.000000 | -0.009967 | 0.027020 | -0.035340 | 0.178237 |
| state_encoded | -0.008890 | 0.003487 | 0.582148 | 0.002449 | 0.003404 | 0.011325 | 0.007637 | -0.009967 | 1.000000 | -0.002688 | 0.012196 | 0.011269 |
| transmission_encoded | -0.009237 | 0.067937 | 0.004632 | 0.037917 | -0.025620 | 0.082067 | -0.034818 | 0.027020 | -0.002688 | 1.000000 | 0.149585 | 0.048032 |
| type_encoded | -0.054727 | 0.037033 | 0.012009 | 0.004149 | -0.008420 | -0.040669 | -0.039897 | -0.035340 | 0.012196 | 0.149585 | 1.000000 | 0.154249 |
| paint_color_encoded | 0.042891 | 0.006945 | 0.003770 | -0.017659 | -0.031176 | 0.085349 | 0.023134 | 0.178237 | 0.011269 | 0.048032 | 0.154249 | 1.000000 |
## Execute the PCA on the dataframe and plot the component and variance ratio
# Set the iterates to run on the number of features in the dataframe
iterates = np.arange(12)
var_ratio = []
# Execute for loop and store the variance ration for each run
for iterate in iterates:
pca = PCA(n_components=iterate)
pca.fit(X_t)
var_ratio.append(np.sum(pca.explained_variance_ratio_))
#Plot the figure
plt.figure(figsize=(10,8),dpi=150)
plt.grid()
plt.plot(iterates,var_ratio,marker='o')
plt.xlabel('n_components')
plt.ylabel('Explained variance ratio')
plt.title('n_components vs. Explained Variance Ratio')
Text(0.5, 1.0, 'n_components vs. Explained Variance Ratio')
Data preparation conclusion¶
Improved the data quality of the vehicle dataset: a. Removing unwanted and high missing values data columns & rows b. Imputed the missing data using mean and most frequent strategy. c. Removed the junk character from the columns d. Scaled the Dataset
Generated co-relation heatmap and executed PCA on all the features a. The result shows that features are not closely related except for state & region. b. The variance relation graph is slightly converging, therefor all features need to be examined in the model.
The final dataset is ready for Model execution.
Modeling¶
With your (almost?) final dataset in hand, it is now time to build some models. Here, you should build a number of different regression models with the price as the target. In building your models, you should explore different parameters and be sure to cross-validate your findings.
Generated the following models to evaluate the Vehicle dataframe
- Linear Regression with polynomial features and sequential feature selection
- Ride Regression Model with Lasso feature selector
- Linear regression Model with polynomial features and Lasso feature selection
- Lasso Regression Model with polynomial features
Calculated the MSE on Test and Training datasets for comparison
#### Model_1 : Linear Regression with Sequential Feature Selection ####
# Selector pipeline to run the selector & Linear Regression
selector_pipe = Pipeline([('selector', SequentialFeatureSelector(LinearRegression())),
('model', LinearRegression())])
# Print Pipeline steps
selector_pipe
Pipeline(steps=[('selector',
SequentialFeatureSelector(estimator=LinearRegression())),
('model', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('selector',
SequentialFeatureSelector(estimator=LinearRegression())),
('model', LinearRegression())])SequentialFeatureSelector(estimator=LinearRegression())
LinearRegression()
LinearRegression()
LinearRegression()
#### Execute Model_1 and extract the mean square error for Training & Test Datasets ####
# Parm to iterate over several features and select the right number using Grid search cross validator
param_dict = {'selector__n_features_to_select': [2, 4, 6, 10]}
selector_grid = GridSearchCV(selector_pipe, param_grid=param_dict)
# Pridict and calculate the mean squared error for training and test datasets:
selector_grid.fit(X_train, y_train)
train_preds = selector_grid.predict(X_train)
test_preds = selector_grid.predict(X_test)
# Calculate the MSE for Training ad Test datasets:
selector_train_mse = mean_squared_error(y_train, train_preds)
selector_test_mse = mean_squared_error(y_test, test_preds)
# Print the values:
print(f'Train MSE: {selector_train_mse}')
print(f'Test MSE: {selector_test_mse}')
Train MSE: 1.160192176358472 Test MSE: 0.6242452522600143
#### Model_2 Ride Regression Model with Lasso feature selector ####
ridge_param_dict = {'ridge__alpha': np.logspace(0, 10, 50)}
# Prepare ridge pipeline with Lasso feature selection
ridge_pipe = Pipeline([
('poly_features', PolynomialFeatures(degree = 3, include_bias = False)),
('ridge', Ridge())])
#Print the pipeline steps
print('## RIDGE REGRESSION MODEL ##')
ridge_pipe
## RIDGE REGRESSION MODEL ##
Pipeline(steps=[('poly_features',
PolynomialFeatures(degree=3, include_bias=False)),
('ridge', Ridge())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('poly_features',
PolynomialFeatures(degree=3, include_bias=False)),
('ridge', Ridge())])PolynomialFeatures(degree=3, include_bias=False)
Ridge()
## Iterate the alpha using Grid cross validator
ridge_grid = GridSearchCV(ridge_pipe, param_grid=ridge_param_dict)
# Fit the pipeline, predict using Test & Train datasets and calculate the MSE errors
ridge_grid.fit(X_train, y_train)
ridge_train_preds = ridge_grid.predict(X_train)
ridge_test_preds = ridge_grid.predict(X_test)
# Claculate the mean squared errors
ridge_train_mse = mean_squared_error(y_train, ridge_train_preds)
ridge_test_mse = mean_squared_error(y_test, ridge_test_preds)
# Print the MSE values for Test and Train
print(f'Train MSE: {ridge_train_mse}')
print(f'Test MSE: {ridge_test_mse}')
Train MSE: 1.1599686178671338 Test MSE: 0.6245355741141243
#### MODEL_3 Linear regression Model with Lasso feature selection ####
model_selector_pipe = Pipeline([('poly_features', PolynomialFeatures(degree = 3, include_bias = False)),
('selector', SelectFromModel(Lasso())),
('linreg', LinearRegression())])
# Print the model
print('## RIDGE REGRESSION MODEL ##')
model_selector_pipe
## RIDGE REGRESSION MODEL ##
Pipeline(steps=[('poly_features',
PolynomialFeatures(degree=3, include_bias=False)),
('selector', SelectFromModel(estimator=Lasso())),
('linreg', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('poly_features',
PolynomialFeatures(degree=3, include_bias=False)),
('selector', SelectFromModel(estimator=Lasso())),
('linreg', LinearRegression())])PolynomialFeatures(degree=3, include_bias=False)
SelectFromModel(estimator=Lasso())
Lasso()
Lasso()
LinearRegression()
#### Model Execution ####
# Pridict the data frame and calculate the mean squared error for Train & Test splits
fx = model_selector_pipe.fit(X_train, y_train)
selector_train_mse = mean_squared_error(y_train, model_selector_pipe.predict(X_train))
selector_test_mse = mean_squared_error(y_test, model_selector_pipe.predict(X_test))
# Print the Train & Test Datasets
print(selector_train_mse)
print(selector_test_mse)
1.153049416785618 0.6285839373470016
## Identify the selected features by the model and get the feature coefficients
selector = model_selector_pipe.named_steps['selector']
# Get the mask of selected features
selected_features_mask = selector.get_support()
# Get selected feature names from the polynomial features
poly_features = model_selector_pipe.named_steps['poly_features']
selected_feature_names = pd.DataFrame(poly_features.get_feature_names_out()[selected_features_mask])
selected_feature_names.columns=['Feature Name']
# Get the feature coefficients
coef= model_selector_pipe.named_steps['selector'].estimator_.coef_
coef_sel = pd.DataFrame(coef[selected_features_mask])
coef_sel.columns =['Coefficient']
fs = pd.concat((selected_feature_names, coef_sel), axis=1)
print(fs)
Feature Name Coefficient 0 year odometer^2 -0.000054 1 region_encoded^3 0.000093 2 manufacturer_encoded state_encoded^2 0.000951 3 state_encoded^3 0.000034
#### Model_4 Lasso Regression Mode ####
# Create a Pipeline with Polynominal features, Sequential feature selection and Linear regression
auto_pipe = Pipeline([
('polyfeatures', PolynomialFeatures(degree = 3, include_bias = False)),
# ('selector', SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto')) ,
('lasso', Lasso(random_state = 42))
])
## Print the model steps
auto_pipe
Pipeline(steps=[('polyfeatures',
PolynomialFeatures(degree=3, include_bias=False)),
('lasso', Lasso(random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('polyfeatures',
PolynomialFeatures(degree=3, include_bias=False)),
('lasso', Lasso(random_state=42))])PolynomialFeatures(degree=3, include_bias=False)
Lasso(random_state=42)
## Model_4 Lasso Regression execution
# Predict the model
auto_pipe.fit(X_train, y_train)
lasso_coefs = auto_pipe.named_steps['lasso'].coef_
# Calculate the Test and Train MSE's
lasso_train_mse = mean_squared_error(y_train, auto_pipe.predict(X_train))
lasso_test_mse = mean_squared_error(y_test, auto_pipe.predict(X_test))
# Print the Lasso MSES
print(lasso_train_mse)
print(lasso_test_mse)
# features names selected by the Model and lasso coefficient
feature_names = auto_pipe.named_steps['polyfeatures'].get_feature_names_out()
lasso_df = pd.DataFrame({'feature': feature_names, 'coef': lasso_coefs})
# Print Feature names
print(type(feature_names))
lasso_df.loc[lasso_df['coef'] != 0]
1.158084602773126 0.6246889064643456 <class 'numpy.ndarray'>
| feature | coef | |
|---|---|---|
| 102 | year odometer^2 | -0.000054 |
| 168 | odometer^3 | 0.000004 |
| 171 | odometer^2 condition_encoded | -0.000002 |
| 234 | region_encoded^3 | 0.000093 |
| 324 | manufacturer_encoded state_encoded^2 | 0.000951 |
| 434 | state_encoded^3 | 0.000034 |
## Find the best model and number of features selected by the Sequential selector along with features coefficients
best_estimator = selector_grid.best_estimator_
best_selector = best_estimator.named_steps['selector']
best_model = selector_grid.best_estimator_.named_steps['model']
feature_names = X_train.columns[best_selector.get_support()]
coefs = best_model.coef_
# Print best estimator
print(best_estimator)
print(f'Features from best selector: {feature_names}.')
print('Coefficient values: ')
print('===================')
pd.DataFrame([coefs.T], columns = feature_names, index = ['model'])
Pipeline(steps=[('selector',
SequentialFeatureSelector(estimator=LinearRegression(),
n_features_to_select=2)),
('model', LinearRegression())])
Features from best selector: Index(['region_encoded', 'fuel_encoded'], dtype='object').
Coefficient values:
===================
| region_encoded | fuel_encoded | |
|---|---|---|
| model | 0.029021 | 0.001582 |
Evaluation¶
With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this. We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices. Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.
## MSE comparison between various models Train/ Test
##
MSE = {
'Model': ['Linear Regression', 'Ridge Regression', 'Linear Regression(Lasso selector)', 'Lasso Regression'],
'Train MSE': [1.160192176358472, 1.1602148172330142, 1.153049416785618, 1.158084602773126],
'Test MSE': [0.6242452522600143, 0.6241035347444294, 0.6285839373470016, 0.6246889064643456]
}
MSED = pd.DataFrame(MSE)
fig = px.scatter(MSED, x='Model', y='Train MSE', color='Model', size='Train MSE', title = 'Mean Squared Error for Model based on Train dataset')
fig1 = px.scatter(MSED, x='Model', y='Test MSE', color='Model',size='Test MSE', title = 'Mean Squared Error for Model based on Test dataset')
fig.show()
fig1.show()
MSED.head(4)
| Model | Train MSE | Test MSE | |
|---|---|---|---|
| 0 | Linear Regression | 1.160192 | 0.624245 |
| 1 | Ridge Regression | 1.160215 | 0.624104 |
| 2 | Linear Regression(Lasso selector) | 1.153049 | 0.628584 |
| 3 | Lasso Regression | 1.158085 | 0.624689 |
## The Error comparison shows Linear regression with the Lasso selector seems to be a good Model when predicted with both Test & Train datasets
## Plot the selected features & importance cofficients
fig = px.bar(fs,x='Feature Name', y='Coefficient', color= 'Feature Name',
width=800,
height=600,
title = 'Feature importance based based on coefficients')
fig.show()
Conclusion¶
Evaluating the MSE based on Training & Test scaled data, the model Linear Regression with Lasso selector and polynomial features is giving better results compared to other Models. Therefore selecting this model for the final Car price evaluation and presenting it to Car dealers.
Deployment¶
Now that we've settled on our models and findings, it is time to deliver the information to the client. You should organize your work as a basic report that details your primary findings. Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.
The key features to focus on by dealers which can give better prices on the car are :
- Car Manufacturer
- Region
- State
- Year of manufacture
- Low Odometer reading
Other attributes to consider are:
- Fuel Efficiency
- Condition of Car
There is a direct dependency on Car price related to the manufacturer and state car is sold. The inventory of used cars can be arranged based on such car manufacturers.
Also, the low odometer reading and year of manufacture are related and attributed to the high price for Car.